-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using infiniteloop
in train_cifar10_ddp.py
#145
Conversation
for x1, _ in tqdm.tqdm(dataloader, total=len(dataloader)): | ||
global_step += 1 | ||
|
||
optim.zero_grad() | ||
x1 = x1.to(rank) | ||
x0 = torch.randn_like(x1) | ||
t, xt, ut = FM.sample_location_and_conditional_flow(x0, x1) | ||
vt = net_model(t, xt) | ||
loss = torch.mean((vt - ut) ** 2) | ||
loss.backward() | ||
torch.nn.utils.clip_grad_norm_( | ||
net_model.parameters(), FLAGS.grad_clip | ||
) # new | ||
optim.step() | ||
sched.step() | ||
ema(net_model, ema_model, FLAGS.ema_decay) # new | ||
|
||
# sample and Saving the weights | ||
if FLAGS.save_step > 0 and global_step % FLAGS.save_step == 0: | ||
generate_samples( | ||
net_model, FLAGS.parallel, savedir, global_step, net_="normal" | ||
) | ||
generate_samples( | ||
ema_model, FLAGS.parallel, savedir, global_step, net_="ema" | ||
) | ||
torch.save( | ||
{ | ||
"net_model": net_model.state_dict(), | ||
"ema_model": ema_model.state_dict(), | ||
"sched": sched.state_dict(), | ||
"optim": optim.state_dict(), | ||
"step": global_step, | ||
}, | ||
savedir | ||
+ f"{FLAGS.model}_cifar10_weights_step_{global_step}.pt", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chunk seems to change a lot but it actually only modifies where x1
is read. The chunk comes from the indentation change as we do not need step_pbar
now.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #145 +/- ##
===========================================
+ Coverage 34.23% 45.22% +10.98%
===========================================
Files 55 12 -43
Lines 6268 1130 -5138
===========================================
- Hits 2146 511 -1635
+ Misses 4122 619 -3503 ☔ View full report in Codecov by Sentry. |
Hi @Xiaoming-Zhao, Thanks for this PR! 1.) I'm pretty sure that using
I wrote a small script that I uploaded on my website (https://imahnshekhzadeh.github.io/#Blog), which uses an torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE test_inf_loop.py --master_addr [...] --master_port [...]
# e.g.: `torchrun --nproc_per_node=2 test_inf_loop.py --master_addr [...] --master_port [...]` When Rank: 1, Epoch: 0, Batch: 2, Data:
[tensor([[-1.3042, -1.1097],
[-0.1320, -0.2751]])]
Rank: 1, Epoch: 1, Batch: 2, Data:
[tensor([[-0.1752, 0.6990],
[-0.2350, 0.0937]])] So clearly,
Clearly, no shuffling happened! 2.) About the |
Thanks for the detailed example, @ImahnShekhzadeh! This is incredibly helpful. Lessons learned. I will close this PR for now as it seems like all required changes have been implemented in #116. Regarding the |
What does this PR do?
This PR avoids using infinite generator provided by
infiniteloop
and directly use thedataloader
instead as discussed in #144. This change follows the structure provided bypytorch
.I tested the change locally and make sure that it can run smoothly.
I also added a
--standalone
command line argument in README, without which I cannot make the script run. This argument is also provided by the official example for single node usage.Before submitting
pytest
command?pre-commit run -a
command?Did you have fun?
Make sure you had fun coding 🙃